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Abstract 

This paper studies fundamental aspects of modelling data using multivariate Watson distribu- 
tions. Although these distributions are natural for modelling axially symmetric data (i.e., unit 
vectors where ±05 are equivalent), for high-dimensions using them can be difficult — largely because 
for Watson distributions even basic tasks such as maximum-likelihood are numerically challeng- 
ing. To tackle the numerical difficulties some approximations have been derived. But these are 
either grossly inaccurate in high-dimensions (Directional Statistics, Mardia & Jupp. 2000) or when 
reasonably accurate (J. Machine Learning Research, W.& CP., v2, Bijral et at, 2007, pp. 35- 
42), they lack theoretical justification. We derive asymptotically precise two-sided bounds for the 
maximum-likelihood estimates which lead to new approximations. Our approximations are the- 
oretically well-defined, numerically accurate, and easy to compute. We build on our parameter 
estimation and discuss mixture-modelling with Watson distributions; here we uncover a hitherto 
unknown connection to the "diametrical clustering" algorithm of Dhillon et al. (Bioinformatics, 
f9(f3), 2003, pp. 1612-1619). 

Keywords: Watson distribution, Kummer function, Confluent hypergeometric function, Directional 
statistics, Diametrical clustering, Special function, Hypergeometric identity 

1 Introduction 

Life on the surface of the unit hypersphere is more twisted than you might imagine: designing elegant 
probabilistic models is easy but using them is often not. This difficulty usually stems from the com- 
plicated normalising constants associated with directional distributions. Nevertheless, owing to their 
powerful modelling capabilities, distributions on h yperspheres contin ue finding numerous applications — 
see e.g., the excellent book Directional Statistics (M ardia and JuppI . |2000T) . 

A fundamental directional distribution is the von Mises-Fisher (vMF) distribution, which models 
data concentrated around a mean-direction. But for data that have additional structure, vMF can 
be inappropriate: in particular, fo r axially symmetr i c data it is more na tural to prefer the (Dimroth- 
Scheidegger)-Watson distribution ( Mardia and JuppI . 200(1 Watson . 19651) . And this distribution is the 



focus of our paper. 

Three main reasons motivate our study of the multivariate Watson (mW) distribution, namely: 
(i) it is fundamental to directional statistics; (ii) it has not received much attention in modern data- 
analysis setups involving high-dimensional data; and (iii) it prov ides a theoretical ba sis to "diametrical 



clustering", a procedure developed for gene-expression analysis ([Dhillon et all 12003). 

Somewhat surprisingly, for high-dimensional settings, the mW distribution seems to be fairly under- 
studied. One reason might be that the traditional domains of directional statistics are low-dimensional, 
e.g., circles or spheres. Moreover, in low-dimensions numerical difficulties that are rife in high-dimensions 



* This work was largely done when the first author was affiliated with the Max Planck Institute for Biological Cybernetics 
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are not so pronounced. This paper contributes theoretically and numerically to the study of the mW 
distribution. We hope that these contributions and the connections we make to established applications 
help promote wider use of the mW distribution. 



1.1 Related Work 



Beyond their use in typical applications of directional statistics (jMardia and Jupp1 . I2000I) . directional 
dist ributions gained renewed a ttention in data-mining, where the vMF distribution was first used 
by (jBaneriee et all 120031 120051 ). who also derived so me ad-hoc paramete r estimates; Non ad-hoc pa- 
rameter estimates for the vMF case were obtained by iTanabe et al 

Mor e recently, the Wa tson distribution was considered in llBiiral et all 12007)) and also in (Sra, 

20071 ). iBiiral et al.l (|2007l ) used an approach similar to that of (jBaneriee et all 120031 ) to obtain an 



ad-hoc approximation to the maximum-likelihood estimates. We eliminate the ad-hoc approach and 
formally derive tight, two-sided bounds which lead to parameter approximations that are accurate and 
efficiently computed. 

Our derivations are based on carefully exploiting properties (several new ones are derived in this 
paper) of the confluent hypergeometric function, which arises as a part of the normalisation constant. 
Consequently, a large body of classical work on special functions is related to our paper. But to avoid 
detracting from the main message limitations, we relegate highly technical details to the appendix. 

Another line of related work is b ased on mixture - model ling with directional distributions, especially 



for high-dimensional datasets. In (jBaneriee et all [2005), mixture-modelling using the Expectation 



Maximisation (EM) al gorithm for mixtures of vMFs was related to cosine-similarity based K-means 
clustering. Specifically. iBaneriee et al.l ( 20051 ) showed how the cosine based K-means algorithm may be 
viewed as a limiting case of the EM algorithm for mixtures of vMFs. Similarly, we investigate mixture- 
modelling using Watson distributions, and connect a limi ting case of the corr esponding EM procedure 
to a clustering algorithm called "diametrical clustering" (jDhillon et all 120031 ). Our viewpoint provides 
a new interpretation of the (discriminative) diametrical clustering algorithm and also lends generative 
semantics to it. Consequently, using a mixture of Watson distributions we also obtain a clustering 
procedure that can provide better clustering results than plain diametrical clustering alone. 

2 Background 

Let S p_1 = {x | x e M p , ||a?||2 = 1} be the (p— l)-dimensional unit hypersphere centred at the origin. 
We focus on axially symmetric vectors, i.e., ±jc E S p_1 are equivalent; this is also denoted by x E P p_1 , 
where P p_1 is the projective hyperpla ne of dimension p— 1. A natural choice for modelling such data is 
the multivariate Watson distribution (jMardia and JuppI . l2000h . This distribution is parametrised by a 
mean- direction /x E P p_1 , and a concentration parameter k£1; its probability density function is 



W p {x;^,k) =c p {K)e K ^ T ^\ 
The normalisation constant c p (k) in (j2.ip is given by 

r(p/2) 



x E 



M>— 1 



27T p /2M(if, K ) ; 



(2.1) 



(2.2) 



wh ere M is Rummer's co nfluent hypergeometric function defined as ( (jErdelvi et allll953l formula 6.1(1)) 
or (jAndrews et all Il999i formula (2.1.2)) 



M(a,c, k) = ^2 



j>0 



d> J 



a.c, k £ 



(2.3) 



and a = 1, a J = a(a + 1) • • • (a + j — 1), j > 1, denotes the rising-factorial. 

Observe that for k > 0, the density concentrates around fi as k increases, whereas for k < 0, it 
concentrates around the great circle orthogonal to fi. Observe that (Qfi) T Qx — fi T x for any orthogonal 
matrix Q. In particular for Qfi = fi, fi T (Qx) = fi T x; thus, the Watson density is rotationally 
symmetric about fi. 
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2.1 Maximum Likelihood Estimation 



We now consider the basic and apparently simple task of maximum-likelihood parameter estimation for 
mW distributions: this task turns out to be surprisingly difficult. 

Let Xi,...,£C„ € pp^ 1 be i.i.d. points drawn from W p (x; fi, k), the Watson density with mean /x 
and concentration k. The corresponding log-likelihood is 



£(H,k;xi, . . . ,x n ) = ti(kh t S> - log M(l/2,p/2, n) + 7), 



(2.4) 



where S = rt _1 X)"=i x i x f ^ s the sample scatter matrix, and 7 is a constant term tha t we can ignore. 
Maximising (j2.4[) leads to the following parameter estimates ( Mardia and Jupp . 2Q00l Sec. 10.3.2) for 
the mean vector 

ft = s\ if k > 0, ft = s p if k < 0, (2.5) 

where sx,...,s p are normalised eigenvectors (g P p_1 ) of the scatter matrix S corresponding to the 
eigenvalues Ai > A2 > • • • > X p . The concentration estimate k is obtained by solving 



9(2' I ' *0 : ~ 



M% 



2, k) 



(0 < r < 1) 



(2.6) 



where M' denotes the derivative with respect to k. Notice that (|2.5[) and (|2.6p are coupled — so we need 
some way to decide whether to solve g(l/2,p/2; k) = Ai or to solve g(l/2,p/2; k) = X p instead. An easy 
choice is to solve both equations, and select the solution that yields a higher log-likelihood. Solving 
these equations is much harder. 

One could solve (|2.6I) using a root-finding method (e.g. Newton- Raphson) . But, the situation is not 
that simple. For reasons that will soon become clear, an out-of-the-box root-finding approach can be 
unduly slow or even fraught with numerical peril, effects that become more pronounced with increasing 
data dimensionality. Let us, therefore, consider a slightly more general equation (we also drop the 
accent on k): 



Solve for k 

M'(a,c;K) 

g{a, c; n) r = r 

M{a, c; K) 

c>a>0, < r < 1. 



(2.7) 



3 Solving for k 

In this section we present two different solutions to (|2.7p . The first is the "obvious" method based on 
a Newton-Raphson root-finder. The second method is the key numerical contribution of this paper: 
a method that computes a closed- form approximate solution to (|2.7[) . thereby requiring merely a few 
floating-point operations! 



3.1 Newton-Raphson 

Although we establish this fact not until Section 13.21 suppose for the moment that (|2.7[) does have a 
solution. Further, assume that by bisection or otherwise, we have bracketed the root K to be within an 
interval and are thus ready to invoke the Newton-Raphson method. 

Starting at Ko, Newton-Raphson solves the equation g(a, c; k) — r = by iterating 

g(a,c-,K n ) -r 

K n+ i = k h — , n = 0, 1, — (3.1) 

g'(a,c; K n ) 



1 More precisely, we need Ai > A2 to ensure a unique m.l.e. for positive k, and A p _i > A p , for negative re. 
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This iteration may be simplified by rewriting g 1 (a, c; k). First note that 



j'(a,c; k) = 



M"(a,c)n) fM'(a,c;K) 



M(a, b; k) \M(a,c;n) 
then, recall the following two identities 

a(a + 1) 



M"(a, c; k) 
M(a + 2, c + 2; «;) 



c(c+l) 
(c+ l)(-c+ k) 



M(a + 2,c + 2;n) 



(c + T)c 

M (a + 1, c + 1; k) + 7 ^-M(a, c; re). 

(a + 1)k 



(a + 1)k 

Now, use both (I3.3|) and (|3 .4[) to rewrite the derivative (|3.2I) as 

g'(a, c; «;) = (!- c/ K;)#(a, c; k) + (a/re) - (p(a, c; k)) 2 . 



(3.2) 

(3.3) 
(3.4) 

(3.5) 



The main consequence of these simplifications is that iteration (|3.1|) can be implemented with only 
one evaluation of the ratio g(a,c;K n ) — M'(a,c; K n )/M(a,c; K n ). Efficiently comput ing this ratio is 
a non-trivial ta sk in itself; an insight into this difficulty is offered by observations in ( Gautschi , Il977t 
iGil et all [20071) . In the worst case, one may have to compute the numerator and denominator separately 
(using multi-precision floating point arithmetic), and then divide. Doing so can require several million 
extended precision floating point operations, which is very undesirable. 

3.2 Closed-form Approximation for (12. 71) 

We now derive two-sided bounds which will lead to a closed-form approximation to the solution of (|2.7[) . 
This approximation, while marginally less accurate than the one via Newton-Raphson, should suffice 
for most uses. Moreover, it is incomparably faster to compute as it is in closed-form. 

Before proceeding to the details, let us look at a little history. For 2-3 dimensional data, or under ver 



ry 

restri ctive assumptions on k or r, some approximations had been previously obtained (IMardia and J upp. 
2000). Due to their restrictive assumptions, these approximations have li mited applicabili t y, esp ecially 



for hi gh-dimension a l data , where these assumptions are often viola t ed (Baneriee et al. I I2OO5I) . Re 
centlv TBiiral et al" ( 20071 ) followed the technique of Baneriee et 
ad-hoc approximation (actually particularly for the case a = 1/2) 



BBG(r) := 



•(1-r) 2c(l-r) : 



(3.6) 



which they observed to be quite accu rate. However, (|3.6I) lacks theoretical justification; other approxi- 
mations were presented in (|Sral . 120071 ). though again only ad-hoc. 

Below we present new approximations for k that are theoretically well-motivated and also numer- 
ically more accurate. Key to obtaining these approximations are a set of bounds localizing k, and we 
present these in a series of theorems below. The proofs are given in the appendix. 

3.2.1 Existence and Uniqueness 

The following theorem shows that the function g(a, c; k) is strictly increasing. 

Theorem 3.1. Let c > a > 0, and k G M. Then the function k — > g(a,c;n) is monotone increasing 
from g(a, c; —00) = to g(a, c; 00) = 1. 

Proof. Since g(a, c; k) — (a/c)fi(n), where is defined in (|A.11[) . this theorem is a direct consequence 
of Theorem El □ 

Hence the equation g{a, c; k) = r has a unique solution for each < r < 1. This solution is negative 
if < r < a/c and positive if a/c < r < 1. Let us now localize this solution to a narrow interval by 
deriving tight bounds on it. 
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3.2.2 Bounds on the solution k 

Deriving tight bounds for k is key to obtaining our new theoretically well-defined numerical approxi- 
mations; moreover, these approximations are easy to compute because the bounds are given in closed 
form. 

Theorem 3.2. Let the solution to g(a,c;n) — r be denoted by n(r). Consider the following three 
bounds: 

(lower bound) L(r) — —- ( 1 H I , (3-7) 



rc — 




r(l - 


r) ( 


rc - 


a 


2r(l - 


-r) 


rc — 


a ( 




(bound) B(r)= '" - J 1 + J 1 + 4( ° + 1)r(1 r) ] (3.8) 

' 1 1 a(c — a) J 

(upper bound) U(r) = — ( 1 H — ) . (3-9) 

r(l — r) 

Let c > a > 0, and n(r) be the solution ()2.7[) . Then, we have 

1. for a/c < r < 1, 

L(r) < n(r) < B(r) < U(r), (3.10) 

2. for0<r< a/c, 

L(r) < B(r) < «(r) < U(r). (3.11) 

3. and if r = a/c, then n(r) = L(a/c) = B(a/c) = U(a/c) = 0. 

All three bounds (L, B, and U ) are also asymptotically precise at r = and r = 1. 

Proof. The proofs of parts 1 and 2 are given in Theorems IA.5I and IA.6I (see Appendix) , respectively. 
Part 3 is trivial. It is easy to see that lim r ^ ,i U(r)/L(r) — 1, so from inequalities p.lOp and p. lip , it 
follows that 

L(r) B(r) U(r) 

urn — — = urn — — - = lim — -— = 1. U 

r->0,l K(r) r-yO,l K{r) r->0,l n(r) 

More precise asymptotic characterizations of the approximations L, B and U are given in sec- 
tion OEM 

3.2.3 BBG approximation 

Our bound s above also provid e some insight into the previous heuristically motivated K-approximation 
BBG(r) of iBiiral et al.l (|2007l ) given by (I3.6p . Specifically, we check whether BBG(r) satisfies the lower 
and upper bounds from Theorem 13.21 

To see when BBG(r) violates the lower bound, solve L(r) > BBG(r) for r to obtain 

2c 2 + a - v / (2c 2 -a)(2c 2 -a-8ac) 2c 2 + a+ y/(2c 2 - a)(2c 2 -a-8ac) 

2(2c 2 -a + c) < T < 2(2c 2 -a + c) ' 

For the Watson case a — 1/2; this means that BBG(r) violates the lower bound and underestimates 
the solution for r € (0.11,0.81) if c = 5; for r g (0.0528,0.904) if c = 10; for r € (0.00503,0.99) if 
c = 100; for r € (0.00050025, 0.999) if c = 1000. This fact is also reflected in Figure [2J 
To see when BBG(r) violates the upper bound, solve BBG(r) > U(r) for r to obtain 

2ac 



r < 



2c 2 



For the Watson case a = 1/2; this means that BBG(r) violates the upper bound and overestimates the 
solution for r € (0, 0.1) if c = 5; for r € (0, 0.05) if c = 10; for r G (0, 0.005) if c = 100; for r € (0, 0.0005) 
if c = 1000. 

What do these violations imply? They show that a combination of L(r) and U(r) is guaranteed to 
give a better approximation than BBG(r) for nearly all r £ (0, 1) except for a very small neighbourhood 
of the point where BBG(r) intersects n(r). 
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3.2.4 Asymptotic precision of the approximations 

Let us now look more precisely at how the various approximations behave at limiting values of r. There 
are three points where we can compute asymptotics: r = 0, r = a/c, and r = 1. First, we assess how 
n(r) itself behaves. 

Theorem 3.3. Let c > a > 0, r G (0, 1); let n(r) be the solution to g(a, c; k) = r. Then, 

a , (c — a — l)(l + a) „, ,, , 
«(r) = + (c-a-l) + - -r + 0{r 2 ), r -> 0, (3.12) 

"<-» - (- - ; ) {f^y + ffiffi^ °(fr -;)*)} ■ <" 3 > 

. , c— a (a — l)(a — c — 1) , „ .n. 
«(r) = - + l-a+- -0--r) + 0((1 -r) 2 ), r -> 1. (3.14) 

1—?* c — a 

This theorem is given in the appendix with detailed proof as Theorem IA.71 

We can compute asymptotic expansions for the various approximations by standard Laurent expan- 
sion. For L(r) we obtain: 

_a(c-^+l) +c r 

(c — a)r 

r , . c 2 (c+l), , c 3 (ac - (c + l)(c — a)) , „ , .,. 

L r = — i f (r - a/c) + -i ^ ^ ^(r - a/c) 2 + 0((r - a/c) 3 ), r -> a/c, 

a(c — a) a z (c — ay 



ti \ c — a a(a — c— 1) . , 2s 

£(r) = + 1 - a + -i ^ 1 - r) + O 1 - r) 2 ), r -> 1. 

1 — r c — a 



For J7(r) we get: 



a . (c — a)(a + l) o N 

C/(r = + (c-a-l) + i A V + 0(r 2 ), r 0, 

r a 



„, . c 2 (c+l). , , c 3 (2ac + a — c 2 ) , , , 2 , , 

£/ r = -7 f (r - «/c) + ~ a/c) 2 + Or - a/c) 3 ), r -> a/c, 

a(c — a) a (c — a) z 

(^/aj-a + c-l) _ ((c/a) + a)+(9(1 _ r)) r _> L 
1 — r 

For -B(r) the expansions are: 

. a a 2 — 2ac + c 2 — c — 1 , 

B(r) = 1 + 0(r), r^0, 

r c — a 

^/ \ c 2 (l + c )/ / \ c 3 (l + c) 2 (2a - c) . , , 2 . ... , 

B(r) = -j Mr - a/c + -± ' v ; r - a/c 2 + O r - a/c) 8 ), r -> a/c, 

a(c — a) a^(c — a) z (c + 2) 

„ , , c — a c+1— a 2 „ , 

B(r) = - + + 1-r, r->l. 

1 — r a 

Finally, for the approximation (13.61) we have the following expansions: 

BBG{r) = -- + {c-a) + 0{r) 1 r -> 0, 
r 

BBG(r)= a + 0(r-a/c), r -> a/c, 
2c(c — a) 

, 2ac-2c 2 -l 2ac+l _ 



We summarize the results in Table [T] below. 



G 



Table [T] uses the following terminology: (i) we call an approximation f(r) to be incorrect around 
r = a, if f(r) / n(f) — > 0, oo as r — > a; (ii) we say /(r) is correct of order 1 around r = a, if /(r) / /t(r) — > C 
such that C ^ 0,co as r -> a; (hi) we say /(r) is correct of order 2 around r = a if f{r)/n{r) = 
1 + 0(r — a) as r — > a; and (iv) f(r) is correct of order 3 around r = a if f(r)/n(r) = 1 + 0((r — a) 2 ) 
as r — > a. 

No matter how we count the total "order of co rrectness" it is clear from Table Q] that our approxi- 
mations are superior to that of (jBiiral et al. L l2007h . 

The table shows that actually L(r) and U(r) can be viewed as three-point [2/2] Pade approximations 
to n{r) at r = and r — a/c and r — 1 with different orders at different points, while B(r) is a special 
non-rational three point approximation with even higher total order of contact. 

Moreover, since we not only give the order of correctness but also prove the inequalities, we always 
know exactly which approximation underestimates n(r) and whi ch overestimates n( r). Such information 
might be important to some applications. The approximation of ( Biiral et al. , 20071 ) is clearly less precise 
and does not satisfy such inequalities. Also, note that all the above facts are equally true in the Watson 
case a = 1/2. 



Approx. 

Point 


L(r) 


B(r) 


U(r) 


BBG(r) 


r = 


Correct of 
order 1 


Correct of 
order 2 


Correct of 
order 3 


Correct of 
order 2 


r — a/c 


Correct of 
order 2 


Correct of 
order 3 


Correct of 
order 2 


Incorrect 


r = 1 


Correct of 
order 3 


Correct of 
order 2 


Correct of 
order 1 


Correct of 
order 1 



Table 1: Summary of various approximations 



4 Application to Mixture Modelling and Clustering 

Now that we have shown how to compute maximum-likelihood parameter estimates, we proceed onto 
mixture-modelling for mW distributions. 

Suppose we observe the set X = {xi, . . . ,x n G P p_1 } of i.i.d. samples. We wish to model this set 
using a mixture of K mW distributions. Let W p (x\fij , Kj) be the density of the j-th mixture component, 
and 7Tj its prior (l < j < K) - then, for observation Xi we have the density 



The corresponding log-likelihood for the entire dataset X is given by 



f(x l \n 1 ,K 1 ,...,fl K ,K K ) = y ^ TTjWpjXilflj, Kj). 



C(X;ti 1 ,K 1 ,...,tJi K ,K K ) = ^^log^^^^TTjWpixilfij, k^. (4.1) 



To m aximise the log- likelihood, we follow a standard Expectation Maximisation (EM) procedure (|Dempster et al 



19771 ). To that end, first bound £ from below as 



WjW p (Xi\(J,j,K 

where /3ij is the posterior probability (for Xi, given component j), and it is defined by the E-Step 



C(X; Ml , Kl> . . . , » K , KK ) > J2. . N log njWp{X ; {tlj ' Kj) , (4.2) 

^— Pij 



TTjWpiXilfM^Kj) 



J2l^lW p {x t \fl h Kl) 
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Input: X — {a;i, ...,x„ : where each Xi £ P p_1 }, K: number of components 


Output: Parameter estimates Wj, fij, and Kj, for 1 < j < K 


Initialise -Kj , fij , Kj for 1 < j < K 


while not converged do 




{Perform the E-step of EM} 




foreach i and j do 






Compute flij using (|4.3[) (or via (|4,6|l if using hard-assignments) 




end 




{Perform the M-step of EM} 




for j = 1 to K do 












Compute fij using (|4.4p 






Compute Kj using (|4.5[) 




end 


end 





Algorithm 1: EM Algorithm for mixture of Watson (moW) 



Maximising the lower-bound (I4.2[) subject to fij fj,j — 1, yields the M-Step: 



Hj = s{ if Kj > 0, fij — s 3 p if Kj < 0, (4.4) 
= 9 _1 (l/2,p/2,rj), where /\, ii'S'fi,. (4.5) 

where denotes the eigenvector corresponding to eigenvalue Ai (where Ai > • • • > X p ) of the weighted- 
scatter matrix: 

Now we can iterate between (|4.3p . (|4.4I) and ()4.5[) to obtain an EM algorithm. Pseudo-code for such a 
procedure is shown below as Algorithm [T] 

Note: Hard Assignments. We note that as usual, to reduce the computational burden, we can 
replace can E-step (|4.3|) by the standard hard-assignment heuristic: 

p = jl> if 3 = axgmaXjvlogTTjv +logW p {x i \fi j >,K j ,), . 
' [0, otherwise. 

The corresponding M-Step also simplifies considerably. Such hard-assignments maximize a lower- 
bound on the incomplete log-likelihood, and yield partitional- clustering algorithms (in fact, we show 
experimental results in Section [5 . 21 where we cluster data using a partitional-clustering algorithm based 
on this hard- assignment heuristic). 



4.1 Diametrical Clustering 

We now turn to the diametrical clustering algorithm of iDhillon et al. (2003). and show that it is merely 
a special case of the mixture-model described above. Diametrical clustering is motivated by the need 
to group together correlated and anti-correlated data points (see Figure Q] for an illustration). For data 
normalised to have unit euclidean norm, such clustering treats diametrically opposite points equivalently. 
In other words, x lies on the projective plane. Therefore, a natural question is whether diametrical 
clustering is related to Watson distributions, and if so, how? 

The answer to this questi on will become appa rent once we recall the diametrical clustering algorithm 
(shown as Algorithm [2) of (jPhillon et all . 120031 ). In Algorithm [5] we have labelled the "E-Step" and 
the "M-Step" . These two steps are simplified instances of the E-step (|4.3p (alternatively I4.6J) and 
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Input: X — {a;i, . . . , as n : as* € P p_1 }, K: number of clusters 
Output: A partition {Xj : 1 < j < .ft'} of A*, and centroids fij 
Initialise fij for 1 < j < K 
while not converged do 
E-step: 

Set Xj <- for 1 < j < K 
for i = 1 to n do 

| A"j <r~ Xj U {cci} where j — argmax 1<h<K (a;f /i/j) 2 
end 

for j = 1 to K do 

A; = 12-,- i- ;,! 



end 



end 



Algorithm 2: Diametrical Clustering 
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Figure 1: The left panel shows axially symmetric data that has two clusters (centroids are indicated by 
'+' and 'x'). The middle and right panel shows clustering yielded by (Euclidean) K-means (note that 
the centroids fail to lie on the circle in this case) with K = 2 and K = 4, respectively. Diametrical 
clustering recovers the true clusters in the left panel. 



M-step (|4.4j) . To see why, consider the E-step (|4.3j) . If Kj — > oo, then for each i, the corresponding 
posterior probabilities /3y — > {0, 1}; the particular /3.y that tends to 1 is the one for which (fijxi) 2 is 
maximised - this is precisely the choice used in the E-step of Algorithm f2J With binary values for /3jj , 
the M-Stcp (14. 4p also reduces to the version followed by Algorithm [2j 

An alternative, perhaps better view is obtained by regarding diametrical clustering as a special 
case of mixture-modelling where a hard-assignment rule is used. Now, if all mixture components have 
the same, positive concentration parameter k, then while computing /3jj via (|4.6|) we may ignore k 
altogether, which reduces Algorithm [T] to Algorithm [2] 

Given this interpretation of diametrical clustering, it is natural to expect that the additional mod- 
elling power offered by mixtures of Watson distributions might lead to better clustering. This is indeed 
the case, as indicated by some of our experiments in Section 15.21 below, where we show that merely 
including the concentration parameter n can lead to improved clustering accuracies, or to clusters with 
higher quality (in a sense that will be made more precise below). 



5 Experiments 

We now come to numerical results to assess the methods presented. We divide our experiments into two 
groups. The first group comprises numerical results that illustrate accuracy of our approximation to n. 
The second group supports our claim that the extra modelling power offered by moWs also translates 
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into better clustering results. 



5.1 Estimating k 

We show two representative experiments to illustrate the accuracy of o ur approximations. The first 
set ( EI5.1.1j) compares our approximation with that of iBiiral et al.l ( 2007t ). as given by (|3.6j) . This set 
considers the Watson case, namely a = 1/2 and varying dimensionality c = p/2. The second set fi i5.1.2j) 
of experiments shows a sampling of results for a few values of c and k as the parameter a is varied. 
This set illustrates how well our approximations behave for the general nonlinear equation (|2.7[) . 



5.1.1 Comparison with the BBG approximation for the Watson case 

Here we fix a = 1/2, and vary c on an exponentially spaced grid ranging from c=10toc=10 4 . For 
each value of c, we generate geometrically spaced values of the "true" in the range [—200c, 200c]. For 
each choice of picked within this range, we compute the ratio r = 3(1/2, c, «*) (using Mathematica 
for high precision). Then, given a = 1/2, c, and r, we estimate k* by solving k s» g (1/2, c, r) using 
BBG{r), L(r), B(r), and U(r), given by (O, ([37f]) . (|3^]l . and (O, respectively. 




0.002 0.004 0.006 0.008 0.01 0.012 0.01 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 



Approximation BBG(r) 




Approximation BBGfi 



0.6 0.8 




Approximation BBG(r) versus ours, c - 1000 (r = 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 




0" c - -v-BBG(r) 



0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 



Figure 2: Relative errors \k - k»|/|k*| of BBG(r), L(r), B(r), and U(r) for c e {10, 100, 1000, 10000} 
as r varies between (0,1). The left column shows errors for "small" r (i.e., r close to 0), the middle 
column shows errors for "mid-range" r, and the last column shows errors for the "high" range (r ~ 1). 



Figure [2] shows the results of computing these approximations. Two points are immediate from 
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the plots: (i) approximation L(r) is more accurate than BBG{r) across almost the whole range of 
dimensions and r values; and (ii) for small r, BBG(r) can be more accurate than L(r), but in this case 
both U(r) and B(r) are much more accurate. 



5.1.2 Comparisons of the approximation for fixed c and varying a 



Approximation BBG(r) versus ours, k-2, c = 10, a e [,01c, ,99c] 



Approximation BBG(r) versus ours, k-200, c - 100, a g [,01c, ,99c] 




0.2 0.4 0.6 0.€ 

r (varies montonically with a) 



Approximation BBG(r) versus ours, k=500, c = 1000, a g [.01c, .99c] 





r (varies montonically with a) 
Apprpximation BBG(r) versus ours, k=1 500, c = 1000, ae [.01c, .99c] 




r (varies montonically with a) 



0.5 0.6 0.7 

r (varies montonically with a) 



0.8 0.9 1 



Figure 3: Relative errors of BBG(r), L(r), B(r), and U(r) for different sets of c and K values, as a is 
varied from 0.01c to 0.99c. 

In our next set of experiments, we chose a few values of c and k (see Figure [3]), and varied a linearly 
to lie in the range [0.01c, 0.99c]. Figure [3] reports the relative errors of approximation incurred by the 
various approximations. 

From the plots it is clear that one of L(r), B(r), or U(r) always yields results more accurate than 
BBG(r). The various results suggest the following rough rule-of-thumb: prefer U(r) for < r < a/ (2c), 
prefer B(r) for a/ (2c) <r< 2a/ ^fc and prefer L(r) for 2a/ ^/c < r < 1. 



5.2 Clustering using mW distributions 

Now we turn to our second set of experiments. Below we show results of two experiments: (i) with 
synthetic data, where a desired "true-clustering" is known; and (ii) with gene expression data for which 
previously axially symmetric clusters have been considered. 

For both our experiments, we c ompare moW (Algor ithm [T] with (I4.6[) for the E-step) against the 
diametrical clustering procedure of iDhillon et al. | (l2003h . The key aim of the experiments is to show 



that the extra modelling power offered by a mixture of mW distributions can provide clustering results 
better than plain diametrical clustering. 
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Table 2: Percentages of accurately clustered points for diametrical clustering vs. moW (over 10 runs). 
Since this is simulated data, we knew the cluster labels. The accuracy is then computed by matching 
the predicted labels with the known ones. In line with the theory, with increasing concentration the 
modelling power offered by moW shows a clear advantage over ordinary diametrical clustering. 



K.2 


Diametrical (avg/best/worst)-% 


moW (avg/best/worst)-% 


3 


52.65 / 56.50 / 51.50 


51.65 / 53.50 / 50.50 


10 


52.75 / 56.00 / 50.50 


54.10 / 57.00 / 50.00 


20 


57.60 / 64.00 / 51.50 


74.45 / 87.00 / 63.50 


50 


66.00 / 78.50 / 50.00 


99.50 / 99.50 / 99.50 


100 


71.20 / 81.00 / 55.00 


100.00 / 100.00 / 100.00 



5.2.1 Synthetic data 

We generated data that merely exhibit axial symmetry and have varying degrees of concentration around 
given mean directions. Since both the diametrical method as well as moW model axial symmetry they 
can be fairly compared on this data. The distinction comes, however, where moW further models con- 
centration (via k), and in case the generated data is sufficiently concentrated, this modelling translates 
into empirically superior performance. Naturally, to avoid unfairly skewing results in favour of moW, 
we do not compare it against diametrical clustering on synthetic data sampled from a mixture of mW 
distributions as moW explicitly optimises such a model. 

For our data generation we need to sam ple points f rom W p (n, fj,), for which we invoke a simplified 
version of the powerful Gibbs sampler of (Hoff, 2009) that can simulate Bingham-von Mises-Fisher 



distributions. We note here that Bingham distribution is parametrised by a matrix A, and to use it for 
sampling Watson distributions, we merely need to realise that A = Kfifi T . 

With the sampling code in hand, we generate synthetic datasets with varying concentration as 
follows. First, two random unit vectors Hi, /J-2 € P 29 are selected. Then, we fix k% = 3 and sample 200 
points from W^(k\, fXi)- Next, we vary ki in the set {3,10,20,50,100}, and generate 200 points for 
each value of ki by sampling from W p (k2, Hi)- Finally, by mixing the n\ component with each of the 
five K2 components we obtain five datasets X t (1 < t < 5). 

Each of these five datasets is then clustered into two clusters, using moW and diametrical clustering. 
Both algorithms are run ten times each to smooth out the effect of random initializations. Table[5]shows 
the results of clustering by displaying the accuracy which measures the percentage of data points that 
were assigned to the "true" clusters (i.e., the true components in the mixture). The accuracies strongly 
indicate that explicit modelling of concentration leads to better clustering as k,i increases. In other 
words, larger k,i makes points from the second cluster more concentrated around ±/^2, thereby allowing 
easier separation between the clusters. 

5.2.2 Real Data 

We now compare clustering results of moW with those of diametrical cluste ring on three g e ne mi croarray 



We now compare clustering results ot mow with those ot diametrical clustering on three gene microarray 
datasets that were also used in the original diametric al clustering paper (iDhillon et all 20031). T hese 



datasets are: (i) H uman Fibrobla s ts (llyer et al. . 19991 ): (ii) Yest Cell Cycle ( Spellman et al.Tl998l ): and 



(iii) Rosetta yeast ( Hughes et all l2000h . The respective matrix sizes that we used were: (i) 517 x 12 



(ii) 696 x 82; and (iii) 900 x 300 (these 900 genes were randomly selected from the original 5245). 

Since we do not have ground-truth clusterings for these datasets, we validate our results using 
internal measures. Specifically, we compute two scores: homogeneity and separation, which are defined 
below by i? avg and S avg , respectively. Let Xj C X denote cluster j; then we define 



j = l Xi£Xj 

Save = ^ , 1 ,, , Y\ \Xj\\ x i\ mm(njfii,-fj,j^i). (5.2) 
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Table 3: Clustering accuracy on gene-expression datasets (over 10 runs). Noticeable differences (i.e., 
> 0.02) between the algorithms are highlighted in bold. 



Method 


Diametrical (avg/best/worst) 


moW (avg/best/worst) 


Yeast-4 

Homogeneity 
Separation 


0.38 / 0.38 / 0.38 
-0.00 / -0.23 / 0.24 


0.37 / 0.37 / 0.37 
-0.04 / -0.23 / 0.20 


Yeast-6 

Homogeneity 
Separation 


0.41 / 0.41 / 0.40 
-0.06 / -0.15 / 0.14 


0.41 / 0.41 / 0.40 
-0.07 /-0.20 / 0.13 


Rosetta-2 

Homogeneity 
Separation 


0.16 / 0.17 / 0.16 
0.24 / 0.08 / 0.28 


0.16 / 0.17 / 0.16 
-0.20 / -0.28 / 0.09 


Rosetta-4 

Homogeneity 
Separation 


0.23 / 0.23 / 0.23 
-0.01 / -0.08 / 0.16 


0.23 / 0.23 / 0.23 
-0.03 / -0.09 / 0.12 


Fibroblast-2 

Homogeneity 
Separation 


0.70 / 0.70 / 0.70 
0.26 / -0.65 / 0.65 


0.70 / 0.70 / 0.70 
-0.01 / -0.65 / 0.65 


Fibroblast-5 

Homogeneity 
Separation 


0.78 / 0.78 / 0.78 
-0.05 / -0.28 / 0.40 


0.76 / 0.76 / 0.75 
-0.12 / -0.30 / 0.35 



We note a slight departure from the standard in our definitions above. In (|5.1|) . instead of summing over 
xffij, we sum over their squares, while in (|5.2[) . instead of fJ-ffii, we use min(/xj/x;, — fijfii) because 
for us +fij and —fij represent the same cluster. 

We note that diametrical clustering optimises precisely the criterion (|5.1|) . and is thus favoured by 
our criterion. Higher values of H avg mean that the clusters have higher intra-cluster cohesiveness, and 
thus are "better" clusters. In contrast, lower values of S &vg mean that the inter-cluster dissimilarity is 
high, i.e., better separated clusters. 

Table[3]shows results yielded by diametrical clustering and moW on the three different gene datasets. 
For each dataset, we show results for two values of K. The H avg values indicate that moW yields clusters 
having approximately the same intra-cluster cohesiveness as diametrical. However, moW attains better 
inter-cluster separation as it more frequently leads to lower S^g values. 



6 Conclusions 

We studied the multivariate Watson distribution, a fundamental tool for modelling axially symmetric 
data. We solved the difficult nonlinear equations that arise in maximum-likelihood parameter estima- 
tion. In high-dimensions these equations pose severe numerical challenges. We derived tight two-sided 
bounds that led to approximate solutions to these equations; we also showed our solutions to be accurate. 
We applied our results to mixture-modelling with Wat son distributions an d consequently uncovered a 



connection to the diametrical clustering algorithm of (jDhillon et al.l . 12003). Our experiments showed 
that for clustering axially symmetric data, the additional modelling power offered by mixtures of Wat- 
son distributions can lead to better clustering. Further refinements to the clustering procedure, as well 
as other applications of Watson mixtures in high-dimensional settings is left as a task for the future. 
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A Mathematical Details 



This appendix includes mathematical details supporting the technical material of the main text. While 
many of the facts are classic knowledge, some might be found only in specialised literature. Thus, we 
have erred on the side of including too much rather than too little. 

A.l Hypergeometric functions 

Hypergeometric functions provide one of the richest classes of functions in analysis. Indeed, any series 
with ratio of neighbouring terms equal to a rational function of the summation index is a constant 
multiple of the generalised hypergeometric function p F q defined by the power-series 



pFq(ai, ■ • ■ , Op] Ci, . . . , Cq, z) — S 



k>Q 



Cl' 



(A.l) 



where a k = a(a + 1) . . . (a + k — 1) is the rising factorial (often also denoted by the Pochhammer symbol 
(a)fc). Hypergeometric functions arise naturally as solutions to c ertain differentia,! equ ations; for a gentle 
introduction to hypergeometric functions we refer the reader to ([Graham et all 119981 ). while for a more 
advanced treatment the reader may find ([Andrews et al. , 1999) valuable. 

In this paper, we restrict our attention to Kummer's confluent hypergeometric function: \Fi, which 
is also denoted as M. Moreover, we limit our attention to the case of real valued arguments. 



A. 1.1 Some useful identities for M(a,c,x) 

We list below some identities for M that we will need for our analysis. To ease the notational burden, 
we also use the shorthand Mj = M(a + i 7 c + i, x); e.g., M = M (a, c, x). 



d n M _ a^_ M 

dx n c n 

c(l — c + x) „ r etc — 1) , , 
Mi = - -M + '-M-i. 

ax ax 

(c - a)M(a + l,c + 2,x) = (c+ 1)M X - (a + 1)M 2 
(c-a)xM(a + 2,c+3,x) = (c+ l)(c+ 2)[M 2 - Afi]; 

(a + l)xM 2 = (c + \){x - c)M x + c(c + 1)M 
xM(a + 2, c + 3, x) = (c + 2)[M 2 - M(a + l,c + 2,x)], 
Ident ity (|A.2[) follows inductively; (IA.3I) fro m dCuvt et al 



(A.2) 

(A.3) 

(A.4) 
(A.5) 
(A.6) 
(A.7) 



1953, formula 6.4(4)); (|A.5[) on combining (jErdelvi et al 



2008, 16.1.9c); (fO)| from ( Erdelvi et al 



19531 formula 6.4(5)) with (Erdelv i et al 
c+1, a -» a+1; and (|A.7|) from (jErdelvi et al 



1953 . formula 6.4(4)); (|A.6[) from (|A.3[) by replacing c 
1953, formula 6.4(5)). 

Now we build on the above identities to introduce a technical but crucial lemma. 

Lemma A.l. The following identity holds for the Kummer function: 



M{ ~M 2 M = 



(c — a)x 



-M(a + l,c + 2,x) 2 - 



-M(a + 2, c + 3, x)M{a, c + 1, x) 



c(c+l) 



M{a + 1, c + 2, x)M(a + 2, c + 2, x) 



(A.8) 
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Proof. Application of (|A.4|) and (| A.5|) yields after collecting terms 

— !— M(a + 1, c + 2, xf l — M (a + 2, c + 3, x)M (a, c+l,x) + - T — ^ — r M (a + 1, c + 2, a:)M 2 

c+1 v ; c+2 v 7 v ' c(c+l) ; 

a(a+l) ,,-2 c(c+l) (c+l)(a-s) 2 

= — : \2 M 2 - -^-M 2 M M t 

c(c — ay {c—a) z x (c — a) z x 

' C(C + l) -M lMo + QC(C + ± C ± 2QC) M 2 Mx. (A.9) 



(c — a) 2 x (c — a) 2 cx 

Application of this formula allows us to write the difference between the left-hand and right-hand sides 
of (|A.8P as 

lhs - rhs = cM i ~ aM 2 / (fl + _( c + y( x _ c)Mi _ c(c + 1)Mq ) = Qj 

c(c + l)(c — a) 

where the equality to is due to (|A~6l) . □ □ 
A. 2 The Kummer ratio 

The central object of study in this paper is the Kummer-ratio: 

*>-''-«->" ^- r^TO^ - (A ' 10) 

This ratio satisfies many fascinating properties; but of necessity, we must content ourselves with only the 
essential properties. In particular, our analysis focuses on the following: (i) proving that g is monotonic, 
and thereby invertible; and (ii) obtaining bounds on the root of g{x) — r = 0. In the sequel, it will be 
useful to use the slightly more general function 



M(a + jit, c + /i, x) 
M (a, c, x) 



Wi) := jfTTI : 3 , M > 0, (A.ll) 



so that g(x) = (a/c)fi(x). Before proving monotonicity of g, we derive two useful lemmas. 
Lemma A. 2 (Log-convexity). Let c > a > and x > 0. Then the function 

I> + m) ^ r( fl + M + fc) a; fc 

" ^ WT^ M(a + ^' + ' T) = ^ T(c + » + k) M =: K ^ X) 



is strictly log-convex on [0, oo) (note that h is a function of fi). 
Proof. Write the power-series expansion in x for /i ajC (/Li; x) as 

Vc(M! x )=Y] h k(a, c ;^)tt » M a > c ' M) = T-^TTT 

Since log-convexity is additive it is sufficient to prove that fi i— > hk(a,c,/j,) is log-convex. For this we 
compute the second- derivative 

a 2 

^-2 log/i fc (a, c, /x) = ip (a + /i + fc) - f/)'(c + /x + k), 

where ip is the logarithmic derivative of the gamma function. We need to show that this expression is 
positive when c > a > 0, k > and fx > 0. According to the Gauss formula ( Andrews et al. , 19991 
Theorem 1.6.1) 



^) = I \ — ~ ) 
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so that 



r"u)= I L£— t dk<Q. 



Hence the function ip'(x) is decreasing and our claim follows. □ □ 

Lemma A. 3. Let c > a > 0, and x > 0. Then the function 

A* !-> "ft — ; — r Af(c - c + fi, x) =: h a c (fi; x) 
is strictly log-convex on [0,oo). 

Proof. Using precisely the same argument as in the proof of Lemma I A. 2 1 we see that /i i— > T(a + fi)/T(c + 
/i) is log-convex. Next, the log-convexity of fx i-> M(c — a; c + \i; x) has been proved by many authors 
(see, for instance. iKarp and Sitnik ( 201ot ) and references therein). Thus multiplicativity of log-convexity 
completes the proof. □ □ 

With these two lemmas in hand we are ready to prove the first main theorem. 

Theorem A. 4 (Monotonicity) . Let c > a > 0. The function x i— > f^{x) is monotone increasing on 
(—00,00), with fn(— 00) = and /^(oo) = T(c + /i)T (a) / (T (c)T (a + /i)) . 

Proof. We divide the proof into two cases: (i) x > 0, and (ii) x < 0. 
Case i: Let x > 0. It follows from (|A~2|) that 

-^-M(a, c,x) = -M(a + l,c+ l,x), 
da; c 

whereby (using our compact notation) we have 

We need to show that the above expression is positive, which amounts to showing 

a + nM^+i a Mi 

c + /i Mf, cM ' 1 ' ' 

or equivalently 

[r(g + fi + l)/r(c + /i + l)]M M+ i [T(a + l)/r(c+l)]Mi 

[r(a + M)/r(c + M)]A/ M > [r(a)/r( c )]M ■ 

The last inequality follows from Lemma IA.21 To see how, recall that if p, i-> h(fi) is log-convex, then 
the function /j, 1— > h(fi + 5)/h(fi) is increasing for each fixed 6 > 0. Thus, in particular applying this 
property to h a ^ c (fi; x) with S = 1 we have 

h a ,c{n + /t a , e (l;x) 
h a ,c{.^]x) /i o , c (0;:e)' 

which is precisely the required inequality. Thi s estab lishes the monotonicity. The value of / M (oo) follows 
from the asymptotic formula ( Andrews et al. . 19991 Corollary 4.2.3): 

M(a, c, x) ~ ^ 2 F (c - a, 1 - a; -; 1/x), x -> 00. (A.13) 

1 (a) a; c_a 



2 Easily verified by noting that when h is log-convex, its logarithmic derivative h'(fi)/h(fi) is increasing, which imme- 
diately implies that the derivative of h(/i + 5)/h(/i) is positive. 
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Case ii: Let x < 0. Like in Case (i) we need to show that 

[Fja + /x + 1)/L(c + /x + 1)]A/ M+1 [L(g + 1)/L(c + lJjM, 

[r(a + /x)/r(c + / i)]M M > [r(a)/r( c )]M ■ 1 ■ > 

but this time for x < 0. Apply the Kummer transformation M(a; c; a;) = e x M{c — a; c; —x) and write 
y = —x > to get 

[r(o + n + l)/r(c +£ + l)]M(c - o; c + // + 1; y) [L(a + 1)/L(c + 1)]M (c - a; c + 1; y) 
[r(a + /z)/r(c + M)]M(c- a ;c + M;y) > [T(a)/r(c)]M(c - a; c; y) 

Using the notation introduced in Lemma IA.3I the last inequality becomes 

h a ,c(fJ- + 1; x) hg tC (l;x) 
h a ,c(p;x) h a>c (0;x) 

which holds as a consequence of the log-convexity of fi i-> /i QiC (/i; x). Finally from the Kummer trans- 
formation and formula (|A. 131) we have 



M(c- a;c + fi;-x) T(c + fi) 1 2 F (a + /i, 1 + a - c; -; -I x) 

tu(x) = ; r ~ — : ; > V 'AS X —> ~00. U 

M(c-a;c;-x) T(c) (-xY l 2 F (a, 1 + a - c; -; -l/x) 

(A.15) 
□ 



A. 3 Bounds on n 

We now derive bounds on k(t), the solution to the equation y(a, c; k) = r. For ease of exposition we 
divide the bounds into two cases: (i) positive n(r) and (ii) negative n(r). 

Theorem A. 5 (Positive n). Let n(r) be the solution to \2. 7\ ); c > a > 0, and r G (a/c, 1). Then, we 
have the bounds 



rc — a/_ 1 — r\ , , rc — a I „ / 4(c + l)r(l — r) \ rc — af, r\ ,. , „. 

1 + < «W < ^-7- r 1 + 4/1+ — t 1 ^—, < -jz r (l + - ■ (A.16) 



r(l — r) \ c — a J 2r(l — r) I y a(c — a) / r(l — r) V a 

Proof. Lower-bound. To simplify notation we use x — n(r) below. Denote n = y(a + 1, c + 1; x). Now, 
replace a <— a + 1, c-s— c+1 and divide by Mi in identity (|A.3|) to obtain 

cr — a , . , 

z = — -, A.17 

r(l - ri) 

where as before r = g(a, c, a;). To prove the lower bound in (|A.16|) we need to show that 

cr — a cr — a cr — a 
> 



r(l — 7*1) r(l — r) r(c — a) 
Elementary calculation reveals that this inequality is equivalent to 

(c- a - l)r + 1 

TT <ri ' 

c — a + 1 — r 

once we account for cr — a > by our hypothesis. Plugging in the definitions of r and r\ we get: 

(c - a - l)oAfi + cM (a+l)M 2 
(c - a + l)cM - aMi (c + l)Mi ' 

where as before we use Mi = M(a + i,c + i,x). Cross-multiplying, we obtain 

h(x) := c(c - a + l)(a + l)M 2 M - (c + l)(c - a - l)oAf^ - c(c + l)M x Mo - 0(0 + 1)M 2 M 1 > 0. 
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Now on noticing that c(c— a + l)(a + l) = ac(c—a) + c(c+l) and (c—a—l)(c+l)a = ac(c—a) — a(a+l), 
we can regroup h(x) to get 

/i(x) = ac(c - a)[M 2 M - M\\ + (M 2 - Mi)[c(c + 1)M - o(a + 1)MJ. 

Now identity (IA.5I) yields 

(c+ lj(c+ 2) 

and a simple calculation shows that 

dX { C — (1) 

c(c + 1)M - a(a + l)M x = — -M(a + 1, c + 2, x) + (c - o)(c + a + 1)M (a, c + 1, x). 

c + 1 

Substituting these formulae into h(x) and using identity (|A.8[) we get a rather complicated term 



, . . ac(c — a) 2 x 

h{x) = ^Ti - 



^M(a|l,c + 2,if - —^M(a + 2,c + 3,x)M(a,c+l,x) 



+ . 1 -t M (a + 1, c + 2, x)M 2 
c(c + 1) 



I \1 1 

" ' ~"' '' M(a + 2,c + 3,x)M(a + l,c + 2,x) 



(c+l) 2 (c + 2) 
(c — a) 2 (c + a + l)x 



M(a + 2,c+3,x)M{a,c+l,x). (A.18) 



( C + l)(c + 2) 

However, on invoking the contiguous relation (|A.7|) . h(x) can be simplified considerably to yield 

^ + = ( a + 1 )( c + 1 ) M(Q; c + ! x)M(a + 2 , c + 3, - aM(a + 1, c + 2, x) 2 . (A.19) 

(c — aj^rr c + 2 

Therefore, the condition h(x) > is equivalent to (after introducing the notation c = c + 1) 

c' a 

M(a, c', x)M(a + 2, c' + 2, x) M(a + 1, c' + 1, x) 2 > 0, (A.20) 



or, in other words 

(a + 1) M(a + 2, c' + 2, x) ^ a M(a + 1, c' + 1, x) 



(c' + l)M(a + l,c' + l,x) c' M(a,c',x) 

But this final inequality follows from Theorem IA.4I by using /j, = 1 in (|A.12[) . 

Upper-bound. To prove the upper bound, first for brevity introduce the notation b — c — a, and 
q = 1 — r. The lower-bound in (|A.16|) can be then rewritten as (b — cq — cr — a > 0) 

b - c 1 ( X + l\ <x , 



9(1 - q) 

which in turn can be rearranged to 

q 2 (x - c/b) - q(x + c - 1) + b < 0. 

Note first that the equation q 2 (x — c/b) — q(x + c — 1) + b = has two distinct real roots since the 
discriminant (upon using b — c — a) 

D = (x + c - l) 2 - 4(c - a)(x - c/(c - a)) = (1 - x - c) 2 + Aax + 4c(l - x) = (1 - x + c) 2 + 4ax > 0. 

We need to consider three cases: 

(1) x — c/b > implies q lies between the roots 

b(x + c - 1) bVT> b(x + c- 1) bV~D 

< q < 



2{bx-c) 2(bx-c) 2(bx-c) 2{bx - c) 
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(2) x — c/b < implies q is smaller than the smaller root or bigger than the bigger root, i.e. 

b(x + c-l) by/D b(x + c-l) bVD 

q < 2{bx - c) + 2{bx - c) ' ° r 2{bx ~ c) 2{bx - c) < 9 ' 

(3) x — c/b = implies b/(x + c — 1) < g. 
Since 

lim 



b(x + c - 1) bVD 



>c/b \ 2{bx - c) 2(6x - c) J x + c-1 

in all three situations we have 

b(x + c - 1) - b^/D 

2{bx - c) <Q ' 
Changing a^a + 1 and c— !>c + 1 here (recall that b — c — a) we get 



0< K* + °)-w*+°r-^-c-i) k 

2(bx — c — 1) 

where as before r\ = g(a + 1, c + 1, x). The positivity is clear on inspection for both bx — c — 1 > and 
&x — c — 1 < 0. Next, after suitably rewriting (|A.17|) . we have 

b — cq 



(l- g )(l-n) 

Applying inequality (IA.21[) here, we obtain 

2(b - cq)(bx - c - 1) 



x < 



(1 - q)b(x + c - y/(x + c) 2 - 4(bx - c - 1)) 

Squaring and simplifying we get the inequality 

(bx-c- l)(x 2 q(l - q)b{c - b) - xb(b - cq)(c - b) - (c + 1)(& - cq) 2 ) < 

for bx — c — 1 > and the inequality 

(bx - c - l)(x 2 q{l - q)b(c - b) - xb(b - cq)(c -b)-(c+ l)(b - cq) 2 ) > 

for bx — c — 1 < 0. Hence both situations reduce to the single inequality 

x 2 q(l - q)b(c -b)- xb(b - cq)(c - b) - (c + 1)(6 - cq) 2 < 0, 

which on plugging q = 1 — r and b = c — a becomes 

x 2 r(l — r)a(c — a) — xa(c — a)(cr — a) — (c + l)(cr — a) 2 < 0. 

Since the coefficient at x 2 is clearly positive x must lie between the roots, in particular it should be 
smaller than the bigger root, which is the upper bound in (IA.16I) . 

The rightmost bound. Verifying the rightmost inequality is an exercise in high-school algebra. □ 



Theorem A. 6 (Negative k). Let n(r) be the solution to {2.1^ , c > a > 0, and r lie in (0,a/c). Then, 
we have the following bounds: 



r(l — r) 



/ 1 — r\ rc — a I I 4(c + l)r(l — r) \ , . rc — af^ r\ , k . 
( 1 + ^) < 2Kl-7) 1 + a(c-a) ) < < r)< Hl^)( l+ a) ^ 
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Proof. Upper bound: To simplify notation we write as before x — n(r). Recall that r\ = ,a(a + l, c+1; x); 
according to (|A.14[) . the value r\ > r; so in view of cr — a < and (|A.17[) . we have the inequality 

cr — a 
r(l — r) ' 

which is equivalent to (noting that x < 0) 




Thus, r must lie between the roots of the quadratic 

, 2 +(£-l) r -^=0. 
\x ) x 

Straightforward analysis shows that the discriminant is positive for all real x if c > a > 0, and we have 
two distinct real roots so that 

1 c 1 / 2(2a- c) c 2 " 1 c 1 / 2(2a-c) C 2 " , A , 

2-^-2V 1 + ^-^ + ^ <r< 2-2^ + 2V 1 + A ^ + ^ (A ' 23) 

We will use the lower bound above written here for n = <?(a + 1, c + l;x) 
„ 1 c+1 1 r 2(2a-c+l) (c+1) 2 

Applying this lower bound for ri to (|A.17[) and dividing by 2x < 0, we obtain the bound 

2(cr - a) 

r(l + (c + \)/x + 2(2a - c + l)/x + (c + l) 2 /^) ' 

By high school algebra we immediately conclude that the denominator is positive, so that 

(-x)y/l + 2(2a -c+l)/x+{c+ l) 2 /x 2 < (x + 1 - c) + 2a/r 

Squaring and rearranging we obtain the desired upper bound in (|A.22|) . 

Lower bounds: First we prove the leftmost bound in (IA.22|) (i.e., we show that it less than x). Since 
r € (0, a/c) we have cr — a < 0; so, following the line of proof of Theorem I A . 5 1 we get 

(c - a - l)r + 1 

ZT > 7-1 

c — a + 1 — r 

which leads to h(x) < where h is as defined in the course of proof of Theorem IA.5I However, since 
x < for r G (0, a/c) we must again show that (where c' = c + 1) 

(a + l)M(a + 2; c' + 2; x) aM (a + 1; d + 1; x) 
(c' + l)M(a+ l;c' + l;x) > c 7 ]!?^ 7 ^) ' 

but this time for x < 0. Applying the Kummer transformation to the inequality above leads to 

[T(o + M+ l)/r(c + M+ 1)]M (c - a; c + /x + 1; y) [T{a + l)/^(c , + 1)]M (c' - a; c + 1; y) 
[r(a + /i)/r(c + /i)]M(c-a;c+/i;y) > [T{a)/T{d)]M(d - a; c'; y) 

with /i = 1 and y — — x > 0. This inequality follows immediately from Theorem IA.3I as we have 
demonstrated in the proof of the previous theorem. 

Proving the second (and tighter) lower bound in (|A.22|) requires more work. Introduce thus the 
notation b — c — a and q = 1 — r. The leftmost bound in (|A.22|) can be rewritten as 

iir £ i)( 1 + !)< 1 - 
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which can be rearranged to 

q 2 (x - c/b) - q(x + c - 1) + b < 0. 
The discriminant of the corresponding quadratic equation equals 

D = (x + c - l) 2 - 4(c - a)(x - c/(c -a)) = (x + c- l) 2 - 4(c - a)x + 4c> 0, 

since c — a > and x < 0. For x — c/b < this implies that q must be smaller than the smaller root or 
bigger than the bigger root of q 2 (x — c/b) — q(x + c — 1) + b = 0. That is, 

b(x + c-l) by/(x + c - l) 2 - 4(6x - c) 
q K 2(bx - c) + 2{bx - c) ' 

or 



b(x + c-l) b^J(x + c - l) 2 - 4(6x - c) 
2{bx - c) 2 (bx - c) < q ' 

Changing a— >a + 1 and c— + 1 here (recall that b = c — a), from the second inequality we obtain 



+ c) - V(.t + c )2 - 4(6x - c - 1)] 

< ^71 T\ < 1 - n, 

2(6x — c — 1) 

where as before rj = (/(a + 1, c + 1, x). Positivity follows by separately considering x + c > and 
x + c < 0. Next, a simple manipulation of (|A.17[) shows that 

b — cq 

X= (i-q^i-nY 

Applying the above inequality concerning r\ here, we obtain the bound (since b — cq < 0) 

2(b - cq)(bx - c - 1) 



x > 



6(1 - q)[(x + c) - y/(x + c) 2 - 4(6x - c - 1)] ' 



Here b— cq < (since r < a/c), bx—c—1 < (since x < 0, b > 0) and (x+c)— y (x + c) 2 — 4(6x — c — 1) < 
(since the second term is bigger than the first). So the above bound on x is negative as expected. By 
simple algebra and in view of the signs we have 

2(b-cq)(bx-c - 1) n r= ; 

x + c - -i ^ ^ > V (x + c) 2 -4(bx-c- 1) > 0. 

xo(l — 

Squaring yields 

x 2 6 2 (l - q) 2 (x + cf - A{b - cq){bx -c~ l)xfe(l - q){x + c) + 4(6 - cg) 2 (6x - c - l) 2 
> ((x + c) 2 - 4(6x - c - l))x 2 6 2 (l - g) 2 , 
whereby reducing similar terms and on dividing by — 4(6x — c — 1) > we obtain 

(6 - cq)xb{\ - q)(x + c) - (6 - cg) 2 (6x - c - 1) - x 2 6 2 (l - q) 2 > 0. 
Simplifying we get 

x 2 q{\ - q)b(c - 6) - x6(6 - cq)(c - 6) - (c + 1)(6 - cq) 2 < 0, 

so that after plugging in q — 1 — r and 6 = c — a we have the inequality 

x 2 r(l — r)a(c — a) — xa(c — a)(cr — a) — (c + l)(cr — a) 2 < 0. 

Thus, x must be between the roots of this quadratic. The fact that it is bigger than the smallest root 
is precisely the inequality between x and the middle term in (|A.22[) . 

Finally, comparing the two lower bounds in (IA.22[) leads to the inequality 



1 + \__r_ > 1 / + 4(l-r)r(c + l) 

2 c — a 2y (c— a)a 

which upon squaring and simplifying reduces to r < a/c, the hypothesis of the theorem. □ □ 
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In the next theorem we find the asymptotic expansions of the solution to g(a, c;x) = r around r = 0, 
r = a/c and r = 1. 

Theorem A. 7. Let c > a > 0, r £ (0, 1); let x(r) be the solution to g(a, c; x) — r. Then, 



a . (c — a — + „, 9s 

a;(r) = — + (c-o-l) + - A V + 0(r 2 ), r -> 0, 

r a 



a\ f c 2 (l + c) c 3 (l + c) 2 (2a-c) 
x(r) = I r 1 ' 



c7 l_a(c— a) a 2 (c — a) 2 (c + 2) 



('- 3 +°((' -;)")}• 



a 

r — s- - 

c 



(A.24) 
(A.25) 

, (A.26) 

1 — r c — a 

Proof. The fi rst step is the standard d ivision of power series (see, for instance, ( Gonzalez . 199l[ The- 
orem 8.8) or ([Hurwitz. Courant I . Il925l Chapter 2.3)) which yields in a neighbourhood of x — 0: 



/ \ c— a „ (a — l)(a — c — 1) ,„ . 
z(r) = h 1 - a + - -(1 - r) + 0((1 - r) 2 ), r -> 1. 



. , aM(a+l,c+l;a;) a a(c— a) a(a — c)(2a — c) „, 
ff(a;) = v —. = - + '-x + ^— fx 2 + 0(a; 3 ). 



cM(a, c; a;) 



c c 2 (l + c) c 3 (l + c)(2 + c) 



(A.27) 



Next, the division of the asymptotic formulas (|A.13j) used with appropriate parameters gives in the 
neighbourhood of x = oo: 

= aM(a + l,c+l;x) = 1 +(o _ c)(1 _ fl) 1 +(1 _ a)(c _ a)(2a _ c _ 2) J_ +0(a .-4). (A .28) 

cM (a, c; x) x x z a; 13 

The inversion of a p ower series in a neighbou rhood of a finite point xq is achieved via the following 
formula of Lagrange ( Hurwitz. Courant . 19251 Chapter 7, Theorem 1*): 

oo 

x(r) = xq + \ lim 
n=l 

where ro = 3(2:0) and g'(xa) 7^ 0. Introducing the notation 

g(x) -r = gi(x - x ) + g 2 (x - x ) 2 + g 3 (x - x Q ) 3 H 

for the Taylor coefficients of g, we obtain (see ((Gonzalez . 199ll (8.14.11))): 

x(r) - xq = — (r - r ) g-(r - r ) H § (r - r )° + 0{(r - r ) ). 

9i 91 91 





( x -x \ " 


(r - r ) n 


dx 71 - 1 


\g(x) ~r J _ 


n\ 



(A.29) 



In our case we have expansions in the neighbourhoods of xq = and xq — 00. For the case Xq = 
formula (|A.29[) immediately reveals: 



x(r) — (r — a/c) 



c 2 (l + c) c 3 (l + e) 2 (2fl-c) 
a(c-a) a 2 (c - a) 2 (c + 2) 



(r-a/c) + 0{{r-a/c) 2 ) 



(A.30) 



which is precisely formula (|A.25[) . For the point at infinity the Lagrange formula does not have the 
form given above. To compute the correct expression introduce the new variable y = 1/x and rewrite 
the expansion (|A.28[) in the form: 



r = Kv) = fi l /y) = qo + qiy + qiy 1 + 



Then according to (|A.29|) : 



92 , 



V = ~(r - qa) *(r - Qo 
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(r- qo f + 0((r~q ) 4 ). 



Again applying standard division of power series (see, for instance (jGonzalez I . Il99ll Theorem 8.8) or 
(|Hurwitz. Courant I . Il925l Chapter 2.3)) we get: 



x(r) 



1 
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32 ,1 / Qa q\ 



y(r) r- q q 1 q x \q x qf 



(r - qo) + 0((r - q ) 2 ) 
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Substituting here the values of qt from (|A.28|) . we obtain the expansion: 



x(r) — 



1-r 



(c - a) + (1 - a)(l - r) + (& ° - r) 2 + 0((1 - r) 3 ) 



(A.31) 



This proves formula (|A.26[) . 

From formula (|A.15j) we immediately derive 

o 0(1 + 0-0) a(l + a-c)(2 + 2a-c) 4 
3 (a; c; as) = — + ( __ x)2 + + 0(l/(-x) ) 

Introducing y = — 1/x we rewrite this as 

r = ay + o(l + a - c)y 2 + a(l + a - c)(2 + 2a - c)y 3 + 0(y 4 ) = ?0 + gi» + g 2 y 2 + QsV 3 + 0{y 4 ) 
Then according to (|A.29|) : 



y = — (r - go) — 3(7- - go) + 



5i 



9i 



(r- go ) 3 + 0((r- go ) 4 ). 



Again applying standard division of power series (see, for instance ([Gonzalez I . I199U Theorem 8.8) or 
(jHurwitz. Courant I . Il925l Chapter 2.3)) we get: 



x(r) — 



-7-7- (r-qo) + 0((r- qo y) 

y(r) r-q qi q\ \qi qf / 



or 



x(r) 



a a(l + a-c) 1 f a(l + a - c)(2 + 2a - c) a 2 (l + a-c) : 



y(r) r 



r + 0(r 2 ) 



a . / ^ . (c- a -!)(! + a) 2 

h (c — — 1) H r + C>(r ), 

r a 



which completes the demonstration of (IA.24|) . 



□ 
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